As usual we first always check and load in our required packages.
# Check if packages are installed, if not install.
if(!require(here)) install.packages('here') #checks if a package is installed and installs it if required.
if(!require(tidyverse)) install.packages('tidyverse')
if(!require(ggplot2)) install.packages('ggplot2')
library(here) #loads in the specified package
library(tidyverse)
library(ggplot2)
This week, we are adding to our data analysis toolkit with a between groups analysis, using an independent samples t-test.
We observed last week how mood impacts active social media behaviour. However, that is not the only factor that influences social media use. For example, Sapienza et al (2023) found that people in rural areas are more likely to use their smartphone for social media and gaming, whereas urban dwellers are more likely to use their phone for navigation and business.
However, we do not know if people living in urban and rural areas
engage with social media differently, regardless of how long they spend
on their chosen platforms. Today we will address this question using the
urban, good_mood_likes,
bad_mood_likes, and followers variables.
What do you think? Will urban and rural dwellers engage differently with social media? Will there be a difference in the number of likes made by people living in urban vs rural areas? Or in the number of followers people have in urban vs rural areas?
Today we will be averaging across mood to get the number of likes in
general for urban and rural dwellers. This means we first need to create
a new variable called likes which is the average of the
likes in a good and bad mood.
We first load in ourPSYC2001_social-media-data.csv
dataset.
social_media <- read.csv(file = here("Data","PSYC2001_social-media-data.csv")) #reads in CSV files
Are you able to fill the code below using the mutate()
function to create this new ‘likes’ variable?
social_media_likes <- social_media %>%
mutate(likes =(bad_mood_likes + good_mood_likes)/2 ) %>% #create a new column with specified values
select(id, urban, likes, followers) #keep specified columns in dataframe
head(social_media_likes)
## id urban likes followers
## 1 S1 1 34.65 173.3
## 2 S2 1 47.15 144.3
## 3 S3 1 48.45 76.5
## 4 S4 1 29.55 171.7
## 5 S5 1 44.75 109.5
## 6 S6 1 23.55 157.5
Now that we have this object it is important to check the format of
the data. Lets use the str() function that we learned about
in the second tutorial to do this.
str(social_media_likes) #provides a summary of the data structure.
## 'data.frame': 60 obs. of 4 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ urban : int 1 1 1 1 1 1 1 1 1 1 ...
## $ likes : num 34.6 47.1 48.5 29.5 44.8 ...
## $ followers: num 173.3 144.3 76.5 171.7 109.5 ...
First we can see that having the values in urban are
coded as either 1 (urban) or 2 (rural). Lets change this so that instead
of using numbers we use the actual descriptions of urban and rural. To
do this we will use the mutate() function with the
case_when() function which replaces (or creates) specific
values in a variable with new ones.
social_media_likes <- social_media_likes %>%
mutate(urban = case_when(urban == 1 ~ "urban", urban == 2 ~ "rural")) #case_when uses if_else logic to replace values with specified values if the cases match.
str(social_media_likes)
## 'data.frame': 60 obs. of 4 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ urban : chr "urban" "urban" "urban" "urban" ...
## $ likes : num 34.6 47.1 48.5 29.5 44.8 ...
## $ followers: num 173.3 144.3 76.5 171.7 109.5 ...
We can see that urban is now classed as a
chr (character) but we will eventually need to split our
graphs by urban. This means that urban should
be a factor instead. We can change this easily by using
as.factor() within the mutate() function. The
as.factor() function is used to convert other datatypes to
factors !
social_media_likes <- social_media_likes %>%
mutate(urban = as.factor(urban))
str(social_media_likes)
## 'data.frame': 60 obs. of 4 variables:
## $ id : chr "S1" "S2" "S3" "S4" ...
## $ urban : Factor w/ 2 levels "rural","urban": 2 2 2 2 2 2 2 2 2 2 ...
## $ likes : num 34.6 47.1 48.5 29.5 44.8 ...
## $ followers: num 173.3 144.3 76.5 171.7 109.5 ...
The data is now in a format that we should be able to easily visualise it and conduct our statistical tests. Well done !
Figure 1: What it feels like teaching this section
We’re now going to look at the data in 2 ways. First, we’re going to
look at how the data is distributed across all participants, so that we
can check if the data meets our assumptions about normality. Second, we
are going to plot our dependent variables (likes,
followers) by group, to gain a visual understanding for
what group differences might look like, if they exist.
Are you able to create a density plot for likes and
another for followers? Note that we use a new argument here
linewidth to control the size of the density line.
social_media_likes %>%
ggplot(aes(x = likes)) +
geom_density(linewidth = 2, colour = "blue") + #the argument linewidth is used to alter the size of the density line.
labs(x = "Number of Likes", y = "Density") +
theme_classic()
social_media_likes %>%
ggplot(aes(x = followers)) +
geom_density(linewidth = 2, colour = "orange") +
labs(x = "Number of followers", y = "Density") +
theme_classic()
likes and
followers look normally distributed to you? Why might the
data be shaped how they are for each variable?
Now we are going to make some density plots and boxplots, split by the urban factor so that we can see the group differences.
Are you able to help with this?
colour argument in
aes() to split the plot by Urban
social_media_likes %>%
ggplot(aes(x = likes, colour = urban)) +
geom_density(linewidth = 2) +
labs(x = "Number of Likes", y = "Density") +
scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
theme_classic()
social_media_likes %>%
ggplot(aes(x = followers, colour = urban)) +
geom_density(linewidth = 2) +
labs(x = "Number of Followers", y = "Density") +
scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
theme_classic()
scale_colour_manual(). This allows you manually define the
colours of specific parts of a graph. Here we have used it to define
colours for specific groups.
social_media_likes %>%
ggplot(aes(y = likes, colour = urban, x = urban)) +
geom_boxplot() +
labs(x = " ", y = "Number of Likes") +
scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
theme_classic()
social_media_likes %>%
ggplot(aes(y = followers, colour = urban, x = urban)) +
geom_boxplot() +
labs(x = "", y = "Number of Followers") +
scale_colour_manual(values = c(rural = "purple", urban = "green")) + #manually define colours of specific parts of a graph
theme_classic()
likes and followers? Was
it in-line with your predictions from activity 1? Are there any caveats
or reasons to be cautious about your interpretations?
We now want to learn whether we have evidence for differences between urban and rural dwellers on the ‘likes’ and ‘followers’ variables.
Can you work out how to perform an independent samples t-test for these variables ?
t.test() function from the
tutorial last week and pay attention to the ‘paired’ argument !
t.test(formula = likes ~ urban, data = social_media_likes, var.equal = TRUE, paired = FALSE)
##
## Two Sample t-test
##
## data: likes by urban
## t = 3.2184, df = 58, p-value = 0.002112
## alternative hypothesis: true difference in means between group rural and group urban is not equal to 0
## 95 percent confidence interval:
## 4.273656 18.336344
## sample estimates:
## mean in group rural mean in group urban
## 52.09333 40.78833
t.test(formula = followers ~ urban, data = social_media_likes, var.equal = TRUE, paired = FALSE)
##
## Two Sample t-test
##
## data: followers by urban
## t = -2.8182, df = 58, p-value = 0.006595
## alternative hypothesis: true difference in means between group rural and group urban is not equal to 0
## 95 percent confidence interval:
## -65.40138 -11.07862
## sample estimates:
## mean in group rural mean in group urban
## 105.6367 143.8767
Figure 2: Exams are hard
This section is an extension activity if you have already finished the required materials. Please check with your tutor that you have a good grasp of the material before moving onto this section.
When performing statistical tests in the real world we sometimes want to visualise the range of our confidence intervals to inform us of the precision of our inferential estimates. Lets also include both measures of central tendency, the median and the mean on these visualisations.
Lets first extract the confidence interval and mean difference from
our t.test() function for followers measure.
results <- t.test(formula = followers ~ urban, data = social_media_likes, var.equal = TRUE, paired = FALSE)
confidence_interval <- abs(results$conf.int) # extract confidence interval
CI_upper <- confidence_interval[1]
CI_lower <- confidence_interval[2]
mean_difference <- abs(results$estimate[1] - results$estimate[2]) # extract and calculate mean difference
Now lets calculate the median as this is not produced by the
t.test() function.
median_difference <- social_media_likes %>%
group_by(urban) %>%
summarise(median_followers = median(followers)) %>%
summarise(diff = diff(median_followers)) %>%
pull(diff)
Now lets combine all this data into a nice dataframe
plot_data <- data.frame(
mean_difference = c(mean_difference),
median_difference = c(median_difference),
lower = c(CI_lower),
upper = c(CI_upper)
)
Lets now plot this using ggplot
# Plot
ggplot(data = plot_data, aes(x = "followers")) +
geom_point(aes(y = mean_difference), size = 4, colour = "green") +
geom_point(aes(y = median_difference), size = 4, colour = "red") +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1, na.rm = TRUE) + #use to plot errors bars. Takes in two main arguments the upper and lower boundary of the error bar (ymin and ymax respecitvely !)
labs(
x = NULL,
y = "Difference in Followers",
title = "Mean and Median Differences in Followers by Urban Group",
caption = "Mean includes 95% CI; Median shown without CI"
) +
theme_minimal()
Well done ! This computing tutorial is now over. Make sure to thank your tutor for another amazing class full of wonderful statistics and learning !
Figure 3: Everyone loves statistics ?)